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(57) Abstract: A music retrieval system which lake an input melody (20) as the query. In one embodiment, changes or dilVerences 
in the distribution of energy across the frequency spectrum over time are used to find breakpoints (125) in the inpui melody in order 
to separate il into distinct notes (135), In another embodiment the breakpoints arc identilied based on changes in pitch over time. 
A confidence level is preferably associated with each breakpoinl and/or note exiracied from the inpul melody. The conlidence level 
is based on one or more of: changes in pilch; absolute values of a spectral energy distribution indicator; and the energy level of the 
inpul melody. 
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MELODY RETRIEVAL SYSTEM 
COPYRIGHT NOTICE 
A portion of the disclosure of this patent document contains material wliich is subject 
to copyright protection. The copyright OAvner has no objection to the facsimile reproduction 
by anyone of the patent document or the patent disclosure, as it appeal's in the Patent and 
Trademark Office patent files or records, but otherwise reserves all copyright rights 
whatsoever. 

RELATED APPLICATIONS 

This appHcation claims priority firom U.S. provisional application serial no. 
60/188,730, entitled, Hunmiing Search Music Recognition System, filed March 13, 2000, 
which appHcation is hereby incorporated herein by reference. 

FIELD OF INVENTION 

The invention relates to the field of music retrieval systems and more particularly to 
retrieval systems which take a melody vocalized by a user as the query. 

BACKGROUND OF INVENTION 

With the proliferation of musical databases now available, e.g., tlirough the Internet or 
jukebox machines, consumers now have ready access to individual songs or pieces of music 
available for puichase or listening. However, being surrounded by so much music, it is often 
difficult for a listener to catch or remember the title of a song or the artists name. 
Nevertheless, if the song is of interest to the listener, he or she can often remember at least a 
portion of its musical melody. The following disclose retrieval of information relating to 
audio data firom a hummed or sung melody taken as a query: U.S. Patent No. 6,121,530 
(Sonoda); A. Ghias, J. Logan, D. Chamberlin, B.C. Smith, Query by Hiimmzjtg, Musical 
Information Retrieval in an Audio Database^ Multimedia '95, San Francisco, pp. 231-236; N. 
Kosugi, Y. Nisliihara, S. Kon'ya, M. Yamamuro, K. Kushima, Music Retrieval by Humming, 
Using Sirrtilarity Retrieval over High Dirnertsiortal Feature Vector Space, 1 999 IEEE Pacific 
Rim Conference on Communications, Computers and Signal Processing, Page(s) 404-407; 
and P.V. Rolland, Raskinis, J-G Ganascia, Musical Content-based Retrieval art over-view^ of 
the Melodiscov Appr'oach and Sy^stern. 

The invention provides an approach different firom those described in tlie above- 
mentioned documents in identifying a musical composition in response to a query tliat is a 
melody. 
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SUMMARY OF INVENTION 
The invention provides methods and systems for retrieving musical selections or data 
identifying musical selections based on a digital version of a melody Avhich originated from a 
soxmd or electronic source, e.g., a person humming, singing, whistling or otiienvise 
1 vocalizing the melody; a musical instrument's audio or electionic output; an analog or digital 
recording of the melody, etc. Breakpoints between notes are identified as ai e distinct notes 
represented by pitch. In addition, one or more confidence levels may be associated with the 
input melody. 

A value or confidence level may be assigned to each breakpomt to provide a measure 
of confidence that the identified brealqpoint is in fact a breakpoint Similarly, a value or 
confidence level assigned to each note may provide a measure of confidence that the 
identified note is a single note, e.g., does not include two or more notes. 

One aspect of the invention provides a method and system for converting a digitized " 
melody into a series of notes. The method and system receive a digitized representation of an 
input melody, identify breakpoints in the melody in order to define notes therem, determine a 
pitch and beat duration for each note of the melody, and associate a confidence level witli 
each breakpoint, or each note, or both. 

The confidence levels associated with breakpoints and/or notes may be detemiined 
using different techniques, some of which are described herein. 

In the preferred embodiment, segmentation of the input melody into distmct notes 
divided by breakpoints is based on changes or differences in the distribution of energy across 
the frequency spectrum over time. The confidence levels associated with each breakpoint 
and/or note may be based on changes in pitch, as well as absolute and relative values of a 
spectral energy distribution indicator. 

One aspect of Hie invention provides a metliod and related system for converting a 
digitized melody into a sequence of notes. Generally speakmg, the method involves 
estimating breakpoints in the input melody based on changes in the distribution of energy 
across the frequency spectrum over tune. In the preferred embodiment, the melody is 
segmented into a series of frames. A spectral energy distribution (SED) indicator is 
computed for each franie and at least initial breakpoints estimates are derived based on the 
SED indicator. Notes are defined between adjacent breakpoints. 

Another aspect of the invention provides another method and related system for 
converting a digitized melody into a sequence of notes. The method mcludes: segmenting the 
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melody into a series of frames; computing the auto-coirelation of each frame; estimating the 
pitch of each frame based on (i) a pitch period corresponding to a shift where the auto- 
correlation coefficient associated with the frame is relatively large and (ii) the closeness of 
the pitch estimate to estimates in one or more adjacent frames; and estimating breakpoints in 
the melody based on changes in tlie pitch estimates, wherein the notes are defined between 
adjacent breakpoints. 

Another aspect of tlie invention provides a method and related system for identifying 
breakpoints in a digitized melody. The method includes: segmenting the melody into a series 
of frames; computing the auto-correlation of each frame; estimating the pitch of each frame 
based on (i) a pitch period corresponding to a shift where the auto-correlation coefficient 
associated with the frame is relatively large and (ii) the closeness of the pitch estimate to 
estimates m one or more adjacent frames; deteraiining regions of said melody where pitch 
estimates are likely to be invalid; and identifying the breakpoints in the melody based on 
transitions between frames having valid pitch estimates and transitions having invalid pitch 
estimates. 

Otlier aspects of the invention relate to methods and systems for determining 
confidence levels for breakpoints and/or notes in a wavefomi representing a melody. These 
methods include segmenting the wavefonn into a series of frames, wherein adjacent 
brealcpoints encompass one or more sequential fi-ames, each note being defined between 
adjacent breakpoints. Then, at least one of the following three steps may be executed: (a) 
computing a spectral energy distribution (SED) indicator for each frame; (b) estimating the 
pitch of each fr ame; and (c) determining the energy level of each frame. The confidence 
levels may be based on any of the following three characteristics: (i) the SED indicator, (ii) 
changes in pitch, and (iii) the energy level. 

An entry may be retrieved from a music database of sequences of pitches and beat 
durations in accordance vsdth a match fiinction that receives the digitized melody obtaiued 
from a melody source as described herein. A method and system for implementing the 
retrieval may determine a score for each entry based on a search which minimizes the cost of 
matching the pitches and beat durations of the melody and the entry, and which may be based 
on minimizing a cost computation which may take into account one or more note insertion 
and/or deletion errors and penalize the cost in accordance vnth confidence levels pertaining 
thereto. 
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Another aspect of the invention relates to a method and system of retrievmg at least 
one entry from a music database, wherein each entry is associated with a sequence of pitches 
and beat durations. The method includes receiving a digitized representation of an input 
melody; identifying breakpoints in the melody in order to define notes therein; associating 
each breakpoint and/or note with a confidence level; and determining a pitch and beat 
duration for each note of the melody. Then, a score is determined for each database entry 
based on a search which minimizes the cost of matching tlie pitches and beat durations of the 
melody and the entry. The search considers at least one deletion or insertion error in a 
selected note of the melody and, m this event, penalizes the cost of matcliing based on the 
confidence level of the selected note or breakpoint associated therewith. At least one entry 
may then be presented to a user based on its score. 

BRIEF DESCRIPTION OF DRAWINGS 

The foregoing and other aspects of the invention will become more apparent fi-om the 
following description of preferred embodiments thereof and tlie accompanying drawings, 
which illustrate, by way of example, the principles of the invention. In the drawings: 

Fig. 1 is a system block diagram shovsdng the major components of a music 
recognition system according to a preferred embodiment of the invention; 

Fig. 2 is a fimctional block diagram showing the processing blocks of a melody-to- 
note conversion subsystem employed in tlie music recognition system of Fig. 1; 

Fig. 3 is a schematic diagram illustrating some of the processing activities of the 
melody-to-note conversion subsystem with respect to a sample input melody; 

Fig. 4A is a nomialized energy spectrogram, plotted against time and frequency, of a 
sample input melody (which sample differs from the melody referenced in Fig. 3); 

Fig. 4B is a graph of the normalized energy spectrum at a first time frame in Fig. 4 A 
plotted against frequency; 

Fig. 4C is a graph of the normalized energy spectrum at a second time frame in Fig. 
4A plotted against frequency; 

Fig. 5A is identical to Fig. 4A (and provided on the same drawing sheet as Figs. 5B 
and 5C for reference purposes); 

Fig. 5B is a graph of a spectral energy distribution indicator, computed in a first 
manner, which is based upon the spectrogram of Fig. 5 A; 

Fig. 5C is a graph of a "mmimum measure", as discussed in greater detail below, 
which is based on the spectral energy distribution indicator. shown in Fig. 5B; 
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Fig. 6A is identical to Fig, 4A (aiid provided on tlie same drawing sheet as Figs. 6B 
and 6C for reference purposes); 

Fig. 6B is a graph of a spectral energy distribution indicator, computed in a second 
manner, which is based upon the spectrogram of Fig. 6A; 

Fig. 6C is a graph of a "minimum measure", as discussed in greater detail below, 
wliich .is based on the spectral energy distribution indicator shown in Fig. 6B; 

; Fig. 7A is identical to Fig. 4A (and provided on the same drawing sheet as Figs. 7B 
and 7C for reference puiposes); 

Fig. 7B is a graph of a spectral energy distribution indicator, computed in a third 
manner, which is based upon the spectrogram of Fig. 7 A; 

Fig. 7C is a graph of a "minimum measure", as discussed in greater detail below, 
wliich is based on the spectral energy distribution indicator shown in Fig. 7B; 

Fig. 8 A is identical to Fig. 4A (and provided on the same drawing sheet as Figs. 8B 
and 8C for reference purposes); 

Fig. 8B is a graph of a spectral energy distribution indicator, computed in a fourth 
manner, which is based upon the spectrogram of Fig. 8 A; 

Fig. 8C is a graph of a "minimum measure", as discussed in greater detail below, 
which.is based on the spectral energy distribution indicator sho^vn in Fig. SB; 

Fig. 9A is identical to Fig. 4A (and provided on the same drawing sheet as Figs. 9B 
and 9C for reference purposes); 

Fig. 9B is a graph of a spectral energy distribution indicator, computed in a fiftli 
manner, which is based upon the spectrogram of Fig. 9A; 

Fig. 9C is a graph of a "minimum measure", as discussed in greater detail below, 
which is based on tlie spectral energy distribution indicator shown in Fig. 9B; and 

Fig. 10 is a schematic diagram illustrating a process for matcliing notes. 
DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS 
1. System Overview 

Fig. 1 shows a music recognition system 10 which comprises foui* major components: 
a melody-to-note conversion subsystem 12; a music reference database 14; a note-niatcliing 
engine 16; and an output subsystem 18. 

The music recognition system 10 takes a digitized input melody 20 obtained from a 
source 1 1 as a query. For reasons explained in greater detail below, it is preferred that the 
input melody originate from a user in the form of hununing, particularly tlirough intonations 
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of notes that are combinations of a semi-vowel, such as "1", and vowel, such as "a" (i.e., 

notes in the form of "la"). However, the input melody may also comprise many other fomis 
of humming, singing, whistling or other such types of music-like vocalization. The input 
melody may also originate from a musical instrument(s). In tliese cases the source 1 1 
5 represents circuitry for recording and digitizing the user's voice or the musical instrament. 
Alternatively, the input melody may originate from a recording of some kind, in which case 
tlie source 1 1 represents the corresponding player and, if necessary, any circuitry for 
digitizing the output of the player. The digitized input melody 20 is supplied to the melody- 
to-note conversion subsystem 12. 

10 The melody-to«note conversion subsystem 12 converts the digitized input melody 20 

into a sequence of musical notes characterized by pitch, beat duration and confidence levels. 
This is accomplished tlirough spectral analysis techniques described in greater detail below 
which are used to find "breakpoints" in the input melody in order to separate it into distinct 
notes. The pitch of each note is determined by tlie periodicity of the input melody waveform 

15 between the note-defining breakpoints. The beat duration of each note is extracted from the 
separation of the notes, i.e., the duration is determined from tlie time period between 
breakpoints. To compensate for error in the separation, each breakpoint is preferably 
associated with a confidence level, which indicates how likely the breakpoint is a valid 
breakpoint. A confidence level is preferably also associated with each note to indicate how 

20 unlikely the identified note actually contains more than one note. The output of the melody- 
to-note conversion subsystem 12 is a differential note and timing file 150 which comprises 
the relative dijfference in pitch and tlie relative difference in beat duration of consecutive 
notes. The difference is preferably expressed in terms of tlie logarithm of the ratio of tlie 
pitch and duration values of tlie consecutive notes. The reason for using pitch and duration 

25 differences is discussed further below. 

The music reference database 14 stores the differential note and timing files of all 
music or songs searchable by the system 10. Each such file preferably comprises a short, 
easily recognizable segment of a song or music, i.e., the so-called "signature melody", but 
may alternatively encompass an entire song or piece of music. These files may be generated 

30 in the first instance by the melody-to-note conversion subsystem 12. 

The note matching engine 16 compares the differential note and timing file 150 from 
the melody-to-note conversion subsystem 12 with songs or pieces of music in the music 
reference database 14, which are stored in a similar file format. Since different users may 
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vocalize or play a song or piece of music in different key and different tempo, the system 10 
does not compare tlie pitch of the uttered melody and the reference files directly, but rather 
the ratio in pitch between consecutive notes. For the same melody, if the scale is shifted to a 
different frequency, the ratio in the frequency (pitch) of the consecutive notes will be the 
same. Similarly, to nomialize for differences in tempo, the system 10 compai'es the relative 
duration of tiie consecutive notes. The note matcliing engine 16 employs dynamic 
programming tecliniques described in greater detail below for matching the differential note 
and timing file 150 with similarly formatted files stored in the music database 14. These 
teclmiques can compensate for pitch en ors and insertions or deletions of notes by the user or 
the melody-to-note conversion subsystem 12. The engine 16 calculates a matching score for 
each song in the database 14. 

The output subsystem 1 8 sorts the songs or music in the database 16 based on the 
matching scores. The highest ranked song(s) or piece(es) of music is selected for 
presentation to the user. 
2. Melody to Note Conversion 

2.1. Overview 

Fig. 2 shows the fiinctional blocks of the melody-to-note conversion subsystem 12. 
The subsystem 12 generates tlie following data from the digitized input melody 20, which is 
used to constmct the output differential note and timing file 150: 

• a list of breakpoints, which indicate the boundaries between distinct notes in the 
uaput melody; and 

• a list of pitches, each pitch being associated with each note between two adjacent 
breakpoints. 

In addition, the subsystem 12 determines one or more confidence levels related to 
breakpoints and/or notes, and uses one or more of those confidence levels in the construction 
of the differential note and timing files 150. Specifically, in the preferred embodiment, a 
confidence measure or level is associated with each breakpoint that indicates the probability 
that the breakpoint is valid. A confidence measure or level may also be associated with each 
identified note, which indicates the likelihood that the identified note does not contain more 
than one note. 

Breakpoints are intended to indicate points of silence or points of inflection (i.e., 
alteration in pitch or tone of the voice) in the input melody. The embodiments described 
herein use more than one tischnique to identify a breakpoint and detemiine its confidence 
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level by considering how "closely" the various techniques have collectively identified a 
breakpoint. For example, if all techniques have identified a breakpoint at the same particular 
point in tlie input melody, the confidence level associated with that breakpoint is relatively 
high. Conversely, if one or less than all of the techniques do not identify a breakpoint at or 
near that particular point in the melody, the confidence level will be lower. 

In tlie illustrated embodiment of Fig. 2, tlii-ee tonal characteristics are considered in 
identifying breakpoints: 

• silence, or conversely regions of the input waveform containing huirraiing 
(represented by output arrow 60); 

• changes in pitch (represented by output aiTow 50); and 

• changes or differences in the distribution of energy across the firequency spectrum 
over time (represented by output arrow 90). 

The first two characteristics should be intuitively understood for tlieir value in 
identifying a breakpoint. The last item is a breakpoint characteristic due to the typical nature 
of human vocalization. More particularly, as mentioned, users can hum melodies using notes 
which are combination of a semi-vowel, such as "1" and a vowel, such as "a", i.e. "la.". 
When enunciating the semi-vowel, it has been found that the mouth is typically actuated in 
such as way that results in the sound energy being concentrated at lower frequencies, as 
compared with the frequency distribution of the sound energy during the vowel. The 
preferi'ed embodiment takes advantage of tliis observation, as discussed in greater detail 
below. 

Notes are defmed between two adjacent breakpoints. The embodiments described 
herein can use one or more tlian one technique to detemiine a confidence level associated 
with each note, which indicates the likelihood that the note contains only one note firom the 
input melody. Because this is equivalent to the confidence that a breakpoint was not missed 
inside the note, the note confidence measures can be derived from the same quantities as used 
for breakpoint confidence measures, except with an inverse relationship. For example, a 
large and rapid change in pitch near a breakpoint increases the confidence in that breakpoint. 
However, large and rapid changes in pitch in the interval between two breakpoints decreases 
the coirfidence that a breakpoint has not been missed. As with breakpoint confidence 
measures, note confidence measures may be based on one or more different indicators. 
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2*2. Detailed Discussion 

One set of processing steps of tlie subsystem 12 begins by filtering tlie input melody 
20 (alternatively referred to as tlie "input waveform") with a bandpass filter 25 in order to 
attenuate frequency components that lie outside the range of expected pitches. 

Next, a framer 30 segments the filtered input wavefomi into a sequence of "fiames" 
of equal period, e.g., 1/32 of a second. Each frame contains a short portion of the total 
filtered input waveform. Adjacent frames may contain overlapping parts of the filtered input 
wavefomi to provide for some degree of continuity therebetween, as known in the art per se. 
The overlap is preferably a tunable parameter and may be expressed as a percentage. Every 
part of the filtered input waveform is thus represented in at least one frame. 

The auto-correlation of each frame is tlien computed at block 35. The auto- 
correlation c[l] of a waveform x[j7] is defined as the sequence c[l] = ^^ _^x[k]x[l k]. 

Tliis provides a measm e of tlie similarity of a signal with a shifted version of itself, where the 
amount of sliift is given by /. The auto-correlation is related to the spectral energy 
distribution of x[n]. The auto-correlation computation will yield a multitude of auto- 
correlation coefficients for each frame. As known in tlie art, peaks in the auto-correlation 
provide an indication of the periodicity or pitch of a waveform, wliich in this case is the pait 
of the filtered input waveform contained in each frame. 

: Block 45 provides a frame-by-frame pitch estimate 50. This is carried out by first 
identifying the "largest" peaks in the auto-correlation of each frame, e.g., the top 2-10 auto- 
correlation values. This yields a mmiber of "pitch period candidates''. The estimated pitch 
period of the frame is determined by selecting the pitch period candidate that corresponds to a 
lai-ge auto-coiTelation peak while simultaneously considering how "close" the pitch period 
candidate is to pitch period estimates in one or more adjacent frames. The adjacent frames 
may be preceding or receding frames, or botli. The preferred embodiment employs a cost 
fimction which weights tlie size of the auto-correlation peaks and the closeness of the 
corresponding pitch period candidates to pitch period estimates in adjacent frames. This 
analysis presumes tliat the human vocal tract cannot radically alter pitch in the short time 
period represented by a frame, e.g., 1/32 second. If no such pitch period can be found from 
among the possible pitch period candidates, the pitch measurement block 45 labels that frame 
as containing no pitch. In this maimer the possibility that there is no reliable pitch in the 
frame is also considered. 
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For example, let^,- be tlie pitch period in frame /, where pi is either one of the 
identified pitch period candidates or a value indicating the lack of any identified pitch. An 
example cost ftmction is ^{D{p,_^,p,) + (1 -c[/7,j)} where the sum is taken over all frames 

i 

in the input melody. The ftmction D(j>._^ , p. ) measures the difference between adjacent 
5 pitch period estimates, for example D{p,_^ ,p,)= a|hiCp,. ) - hiCp,_, )| where « = ^ 2 • If 
either//,- or/?,./ indicates that there is no identified pitch, I>(p._^,p.) is set equal to a constant, 
e.g. 4. The value clpi] is the normalized autoconelation at the shift coiTesponding to pi. Ifp, 
indicates that there is no identified pitch, then we assign c[pi] = 0. The exact sequence of 
pitch period candidates minimizing this cost ftmction can be computed by a dynamic 
10 programming procedvure similar to that described in Section 3 on the note matching engine. 

Block 55 seeks to detect regions 60 of the input waveform containing usefiil sound 
such as humming or music (as opposed to silence or noise), based on the frame-by-fi^me 
pitch estimate 50 and the frame-based auto-correlation 40. The manner in which this is 
preferably cairied out is exemplified in Fig. 3. In Fig. 3, each position along the horizontal 

15 axis represents a frame, with the "P" line 56 representing input pitch estimates 50 (Fig. 2) and 
the "E" line 57 representing the energy of each frame, as determined from the frame-based 
auto-correlation 40 (Fig. 2). In this example (Fig. 3), the pitch estimates and energy 
estimates have quantized values ranging from 1-9. The sound detection block 50 first looks 
for regions that may have usefiil sound because a valid pitch estimate was computed in the 

20 block 45. This is shown in line "SI" 58 of Fig. 3 where the symbol *H' represents usefiil 
sound. Next, the sound detection block 55 considers the average energy of the frames in each 
region. Wliere the average energy is below a specified threshold, the region is considered to 
have no usefiil sound. This is shown in line "S2" 59 of Fig. 3. In the illustrated example, the 
block 50 thus considers region 60B of the input waveform as being silent. Conversely, 

25 regions 60A and 60C are considered to contain usefiil sound. Regions containing usefiil 
sound are sent to a breakpoint detection block 100. 

The brealqjoint detection block 100 (Fig. 2) also receives input from a parallel 
processing path comprising a high-pass filter 65, a framer 70 and a spectral energy 
distribution indicator computation block 75. The high-pass filter 65 filters the input 

30 waveform 20 in order to emphasize high frequency information tliat has been found to be 
usefiil in detecting the breakpoints between notes. The framer 70 slices the filtered input 
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waveform into frames, which are identical m scope and temporal position to the frames 
generated by framer 30. 

The spectral energy distribution ("SED") indicator computation block 75 computes a 
numerical measure or SED indicator 90, wliich indicates how the sound energy is distributed 
in each frame. The SED indicator preferably assumes relatively high values if the sound 
energy is concentrated near high frequencies and relatively low values if the sound energy is 
concentrated near otlier frequencies, as described in greater detail below. For example, a 4 
kHz frequency range may be considered with high fr equencies deemed to those approaching 
4 kHz and low frequencies deemed to be those near zero kHz. 

The breakpoint detection block 100 finds initial estimates for the locations of note 
breakpoints (i.e., "candidate" brealqpoints) 105 and computes a confidence measure 110 
associated with each candidate breakpoint 105. This confidence measure varies between 0 
and 1 , where a value near 1 indicates that the breakpoint is very reliable. 

i 

The breakpoint detection block 100 operates on regions of the input wavefomi 
suppHed by the sound detection block 55. In Fig. 3, for example, these would be regions 60A 
and 60C. The detection block 100 assigns breakpoints to the beginnmg and end frames of 
each region. Thus, the transitions between a frame with no pitch estimate and a frame with a 
valid pitch estimate is one method that may be used to identify breakpoints. Tliese 
breakpoints are given a confidence level of 1. This is exemplified in Fig. 3 by the "x" symbol 
in the "B" line 101. 

Within each region, tlie block 100 detects candidate breakpoints based on minima 
present in the SED indicator 90. These are exemplified in Fig. 3 by the "'^^ symbol in the 
"B" line 101 . The reason for this can be understood on an intuitive level by considering a 
melody waveform tliat consists of a sequence of notes, each of which is sung as 'la." The 
vowel part "a" is typically longer in duration than the consonant part and is usually better 
defined spectrally. Therefore, it should provide the most reliable information for pitch 
extraction. Segmentation can be performed if the "1" pait of each "la" can be detected. 
Because "1" is a semivowel, it typically contains strong pitch periodicity. However, because 
of the constriction of the moutli during production, it contains less overall energy and less 
high frequency resonant structure. 

This can also be seen through experimental observation. For example. Fig. 4A is a 
spectrogram of a normalized energy spectrum for a sample melody hummed using "la" notes. 
(Note that Fig. 4A relates to a sample melody that differs from that shown in Fig. 3.) More 
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pai ticularly, the nonnalized energy spectmni is shown as a gray scale image wherein 

normalized energy values approaching a maximum value are white and values near zero are 
black. The vertical axis of the spectrogram corresponds to frequency and the horizontal axis 
coixesponds to time. Thus, a vertical cross-section of the spectrogram essentially corresponds 
to one frame and represents the normalized energy spectrum of the frame as a ftmction of 
frequency. The energy spectrum of a frame is defined as the squared magnitude of tlie 
Discrete Fourier Transform of each frame, and always assumes a positive value. The 
normalized energy spectrum of a frame is obtained by normalizing the energy spectrum of the 
frame by the total energy in the frame; i.e., the sum of the energy spectrum over all 
frequencies in the frame. 

A strong banding structure (i.e., generally horizontal white lines) exists between 
frame nos. 50 and 350. The rest is basically noise. The bands are harmonics (multiples) of the 
pitch frequency and move closer and farther apart as the pitch changes. The dominant band 
in each frame is not the pitch frequency, but some harmonic of it. Which harmonic is 
emphasized depends strongly upon the shape of the vocal tract and mouth at the time instant. 

There are about ten notes in Fig. 4A with the breakpoints being indicated by the 
vertical white lines 160 in the image. (Lines 160 are not part of the spectrogram but are 
merely used to indicate the position of the breakpoints in the image.) Breakpoints between 
notes can be seen where the dominant band shifts lower because constrictions in the vocal 
tract reduce the amount of high frequency energy uttered. This is shown more clearly in Figs. 
4B and 4C. Fig. 4B shows the normalized energy spectrum plotted versus frequency for 
frame no. 1 50, which is a near breakpoint Fig. 4C shows the same kind of plot for fraine no. 
170, which is in the middle of a hummed note. A shift in the energy distribution to higher 
frequencies is clearly evident. Note that these plots are essentially a cross-section tlirough a 
vertical slice of the spectrogram illustrated in Fig. 4A- 

The SED indicator 90 represents the shift in energy distribution. This a numerical 
measure wliich combines the spectral energies in each frame in such a way that the value of 
that measure is large if the energy distribution is concentrated in certain frequency bands and 
small if the energy distribution is concentrated in others. There are a variety of ways to 
compute the SED indicator. 

In one implementation, the SED indicator 90 can be computed as the first moment of 
the energy spectrum in each frame divided by the zero*^ moment. More particularly, let X(k) 
be the energy spectmm at frequency bin k; the corresponding spectral energy distribution 
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measme is given by ~ . The summation is carried out over all frequency bins from 0 

A- 

(DC) up to the frequency bin conesponding to the Nyquist frequency. Frequency bins past 
the Nyquist frequency contain no additional inforaiation due to aliasing. Tliis results in large 
values ifX(k) is concentrated around large frequencies (large k). The graph of Fig. 5B plots 
5 the SED indicator (when computed as just described) for tlie sample mput melody of Figure 
4A, i.e., for all frames. In Fig. 5B, vertical lines 162 indicate the positions of brealcpoints. 
Fig. 5A repeats Fig. 4A to facilitate comparison. Note that the SED indicator drops to 
minimum values at or neai* the breakpoints. 

Based on the SED indicator 90, the breakpoint detection block 100 preferably derives 

10 a "minimum measure" at each frame, which is positive if there is a local minimum in the 
SED indicator "near" the coixesponding frame, and zero otherwise. In this context the 
number of "near" frames is, for example, 15 frames before and 15 frames after the present 
frame. By considering or integrating such infomiation over a number frames the SED 
indicator can be smoothed. The amplitude of the minimum measure is larger the "deeper" the 

]5 local minimum. A linear relationsliip is preferably employed for determining the amplitude 
of the minimum measme but other types of relationships can be employed in the alternative 
such as power, exponential, and logarithmic relationships. Fig. 5C shows an example of the 
minimum measme for the SED indicator shown in Fig. 5B. In Fig. 5C, vertical lines 164 
indicate the positions of breakpoints. It will be seen from Fig. 5C that the minimxmi measure 

20 takes into consideration the relative depth of the local minima of the SED indicator in 

comparison to the surromiding plateau, and also smoothes the SED indicator to eluninate the 
many peaks and valleys after frame no. 350. The breakpoint detection block 100 uses the 
minimum measure to determine candidate breakpoints by finding tlie locations of the positive 
peaks tlierein. 

25 If desired, an additional or alternative method for identifying breakpoints is by 

determining locations of rapid changes in the valid pitch estimate across frames. The rate of 
change in pitch at a given frame can be determined fr om examination of the pitch changes in 
smTOimding frames. For example, if/?, is the pitch estimate at frame z, the rate of change is 

r 

can be estimated as being proportional to ^ where the parameter r determines the size 

A=-r 

30 of the neighborhood in the past and ftiture frames used to detennine the pitch change. An 
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example choice might be r = 3. Larger values are less iiifluenced by noise or pitch mis- 
estimations, but on tlie other hand will have less temporal resolution. 

The confidence measure 110 for each candidate breakpoint is preferably a weighted 
siun of four nimibers. The first number is large if tlie absolute value of tlie SED indicator is 
5 "small" in the neighborhood of the breakpoint, e.g., less than about 75% of the average value 
over the input waveform. The second number is large if the minimum measure in the vicinity 
of the breakpoint is "large", e.g., larger than about 80% of the maximum value over the input 
v/aveform. The third number is large if the rate of change of pitch at the breakpoint is 
"large", e.g., more than about 10 semitones per second. The fourtli number is large if the 
10 average energy in frames aiound the breakpoint is "small", e.g. less than 50% of the 

maximum value in some neighborhood around the candidate breakpoint. Preferably, each of 
these nmnbers is weighted equally, altliough a variety of weightings may be used in tlie 
alternative. 

At block 115, only those breakpoint candidates 105 with confidence measures 110 

15 exceeding a certain "threshold" are retained, e.g., 0.45. Tlais yields the final note breakpoints 
125, which delineate the notes and tlieir beat durations, and final confidence measures 120. 

At block 115, a confidence measure 122 is also associated with each note identified 
between two breakpoints 125. This confidence measure is designed to indicate the possibility 
that the identified note does not contain more than one note from the input melody, due to a 

20 missed breakpoint in tibe breakpoint detection block 100. The note confidence measure 122 
is a weighted sum of four numbers. The first nxmniber is small if the variation of the SED 
indicator for frames within the note is "large," e.g., the difference between the maximum and 
minimum value is greater than some percentage (e.g. 20%) of the average value. The second 
nxmaber is small if the maximum "minimimi measure" taken over all frames in the note is 

25 large, e.g. greater tlian 20% of the maximum value over the input waveform. The third is 
small if the variation of the identified pitch periods for frames inside the note is large; e.g. the 
maximimi and minimum values vary by more than one semitone. The fourth is small if the 
variation in the en&rgy level for frames inside the note is "large"; e.g. the difference between 
the maximum and minimum value is larger than some percentage (e.g. 20%) of the average 

30 value. Note that flie dependence of the note confidence measure on the SED indicator, 
minimum measure and identified pitch is opposite that for the breakpoint confidence 
measure. This is because the breakpoint confidence measure indicates the confidence that a 
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breakpoint was not mistakenly added. On the other hand, tlie note confidence measure 

indicates the confidence that a breakpoint was not mistakenly deleted. 

At block 130, the pitch for each note is determined by merging the pitch periods 
across the frames falling between two breakpoints delineating tliat note. It is preferred to 
merge the pitch by finding the median of the pitch estimates between the two brealcpoints. 
The median computation is less sensitive than the average to occasional large errors in the 
pitch estimates at individual frames. This yields the note pitch 135. 

The differential note and timing file 1 50 is generated by block 140, The pitch ratio 
and the beat duration ratio are expressed as the log of the ratio between two consecutive notes 
which are given as follows: 



^-log, 
respectively. 



where F,- and Fj+i ai-e the pitch frequencies of notes i and i-f I, 



• RTf = ^^^^[^j where Tj and Tj+i are the beat durations of notes i and i+1, 
respectively. 

From the foregoing segmentation and pitch estimation, useful infoiTQation has been 
extracted about the input melody. However, this information may contain errors as the user 
may vocalize or play some notes in an incoirect pitch or with incoirect beat duiation. The 
note-matchmg engine has tlie capability and flexibility to tolerate errors in both notes and 
beats, as discussed next. 
3. Note-matching engine 

The note-matching engine 16 (Fig. 1) is a score-based engine. It generates a score for 
each song in the reference database 14 based on tlie similarity of the input melody input to 
the songs in the database, taking into accomit the confidence levels of each identified 
breakpoint and each extracted note. By using dynamic programming the engine 16 attempts 
to compensate for errors generated either by the user, who may have vocalized or played the 
melody with wrong notes or wiong beats, or by the melody-note conversion subsystem 12, 
which may miss some notes, over-count notes or measure the note duration incorrectly. 

Instead of using absolute beat and note information, the preferred embodiment uses 
relative beat and note information for the matching process because the user may vocalize or 
play the melody in any scale, and not necessarily the 12-tone octave scale. Similarly, the user 
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may vocalize or play tlie melody in any tempo. Therefore, relative pitch and beat data is 
preferred. 

The inputs to tlie matching engine 16 are the differential note and timing file 150 and 
candidate differential note and timing files from the music database 14. To compensate for 
the insertion and deletion errors caused by tlie user or the conversion subsystem 12, it is 
desirable to find the likeiiliood of matching instead of an exact match between two files. Tiiis 
problem is similar to tlie classical longest common subsequence problem in which two strings 
are given and a maximum length common subsequence of these two strings is found. The 
note-matching engine 16 employs a dynamic programming approach described below to 
solve this problem in an optimal manner. 

The engine 1 6 sets up a 2-dimensiojial matrix 1 SO for each song matcliing, as 
exemplified in Fig. 10. The Y-axis of the matrix 180 represents a string Y = (Yi,Y2,...,Yni) 
from the differential note and timing file of the candidate song where each entry Yj is a tuple 
or vector (YRFi, YRTj). YRF represents the pitch ratio and YRT represents the beat duration 
ratio of tlie corresponding entry. The X-axis of the matrix 180 represents a string X = 
(Xi,X2,. . .,Xn) from the differential note and timing file 1 50 generated by the note conversion 
subsystem 1 2. Each entry Xi is a 4-tuple or vector (XRFi, XRTj, XICONj, XDCONi) where 
XRPi, XRTi , XICONj and XDCONj represent the pitch ratio, the beat duration ratio, the 
confidence level of the note breakpoint, and the confidence level of the note preceding the 
breakpoint, respectively. 

The cost of a matching between an entry in X and an entry in Y is defined as the 
weighted sum of tlie absolute difference between the corresponding RP and RT. For example, 
the cost of matching Yj and X,- is equal to 

• match_,cost(Xi,Yj)= a[7ifi^ ?^\YRT, -XRT,\ 

where a and p are the relative weights of pitch and beat duration ratios, 
respectively. The cost reflects the error of matching Xi with Yj. If an entry Xi in X is 
perfectly matched with an entry Yj in Y, the cost of match is equal to zero. The objective of 
tlie song-matching algorithm is to fijid the subsequence of Y with the minimum matching 
cost with X. The score of matching is thus the cost of matching. The lower the score, the 
better the match. If there is no insertion or deletion error in the input differential note and 
timing file 150, then the cost of matching the string (Xi,X2,. ..,Xn) with a sub-string 
(Yj,. . .,Yj+n-j) in Y is given by the following recursive formula: 
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I . • min_matcIi_cost((Xi,X2,...,Xn), (Yj,...,Yj+n-i)) = match_cost(X„,Yj+n.i) + 

I mm_match_cost((Xi,X2,. . .,Xn.i), (Yj,. . .,Yj+n-2)). 

In practice, the j index may range from 1 to m-n+1 (where m is the total number of 
notes in Y). The lowest value of min_match_cost() (as j ranges from m-n+1) is selected as 
5 the score for the candidate song. 

However insertion or deletion errors may happen. To compensate for this the engine 
16 allows for.matchmg with note insertions or note deletions. If there is an insertion before 
^ note Xn, the cost of matching is given by: 

' • min_match_cost((Xi,X2,...,Xn), (Yj,...,Yj+n-i)) = match_cost(Xn,Yj+n-i) + 

JO mm_match_cost((Xi,X2,...,X„.2), (Yj,...,Yj+n-2)) 

For k number of insertions, the cost of matching is given by: 

• min_match_cost((Xi,X2,. . .,Xn), (Yj,. . .,Yj+n-i)) = match_cost(Xn,Yj+n- 
i)+min_match_cost((Xi ,X2, . . . ,Xn.k-i). (Yj, . . . ,Yj+n.2)) 

If there is a deletion before the note Xn, the cost of matching is given by: 
15 • min_match_cost((Xi,X2,. . .,X„), (Yj,. . .,Yj+n.O) = match_cost(X„,Yj+„. 

j)-f-min_match_cost((Xi ,X2, . , . .Xn^iX (Yj, . . . , Yj+^-a)) 
For k number of deletions, the cost of matching is given by: 

• min_match_cost((Xi,X2,. . .,Xn), (Yj,. . .,Yj+n-i)) = match_cost(Xn,Yj+n.i)+ 
min_match_cost((Xi,X2,. . .,Xn.i), (Yj,. . .,Yj+„-k.2)). 

20 Insertion and deletion are not the norm. So, tlie matching process, although it allows 

for insertion and deletion, also adds a penalty term when the engine 16 tries to match notes 
assuming there are k insertions or deletions. However, the conversion subsystem 12 provides 
a confidence level for every breakpoint and every note that indicates how likely the 
breakpoint is a "correct" breakpoint and how likely the note is a "correct" note. A low 

25 breakpoint confidence level means that the transition is likely to be a wrong transition and 
hence may result in an insertion error. So a low breakpoint confidence level also implies the 
note is likely to be an insertion. A low note confidence level means that the note is likely to 
be composed from several notes and breakpoints are mistakenly deleted. Therefore, when a 
low note confidence level is encountered, there is a higher chance that a deletion error 

30 occxm-ed. (In other words, the breakpoint confidence level reflects note insertion error and 
the note confidence level reflects note deletion error.) Hence, if the note is matched 
assuming there is an insertion or deletion error, the penalty should be lowered. For this 
reason the engine 1 6 adjusts tlie penalty by weighting it with the breakpoint and note 
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confidence levels. A breakpoint that is associated \vith a low confidence level is more likely 

to be an insertion and hence incurs a lower penalty during matching for note insertion. A 

note tliat is associated with a low confidence level is more likely to be a deletion and hence 

incvirs a lower penalty during matching for note deletion. The above min_match_cost 

calculations are updated as follows: 

For k insertions: 

• rain_match__cost(P^ J ,X2, . - . ,Xn), (Yj, . . . , Yj+n-i)) = match__costpCn, Yj+n-i) + 
min_match_cost((Xi,X2,..-,Xn-k.iX (Yj,.-.,Yj>n-2)) 

k 

(penalty of thei* uisertion)* (the corresponding i* breakpoint's confidence level) 
For k deletions: 

• niinjcnatch_cost((XuX2,...,Xn), (Yj,...,Yj+n.i)) = match_cost(Xn,Yj+n- 
i)+min_match_cost((Xi5X2,...,Xn.05 (Yj,...,Yj+n-k-2))-+ ( penalty of k deletions) * 
(tlie corresponding note's confidence level) 

Based on the recursive structure of the minimum matching cost calculation, a 
dynamic programming approach is used to implement tlie note-matching algorithm. Fig. 10 
illustrates the above cost calculation. This figure shows the first 4x4 matrix for a song in a 
database being compared against a four-note hummed melody. The note matching engine 16 
operates in a reverse direction, i.e., the last note of the hirauned melody is considered fii st 
against the latest notes of the song. For each matrix point, the engine 1 6 seeks a preceding 
note having the lowest cost, which translates into highest similarity in relative pitch, relative 
beat duration and confidence level. At matrix point (4,4) the engine 16 considers the 
following possibilities: 

Matrix Cost 
Direction Point Meaning Calculation 



(-1,-1) (3,3) Notes and beats are normal ^^^4 q _ 54 3| + p|o.75 - 0.8O] 

sequence, i.e., no insertions 
or deletions 



(-1,-2) (3,2) Note is missing in hummed a|62,0 - 64.3| + p jO.5 - 0.80| + 

^^^^^y (cost of 1 deletion)*0.7 
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a|60-64.3|+p|0.5-0.8l + 
(cost of 2 deletions)*0.7 



(-2,-1) (2,3) Extra note is added/inseited ^^^/^ _ 52] + p |0.75 - 0.5| + 

in hummed melody . ^x-isi- ^- xr^no 

^ (cost of 1 msertion)'*^0.8 

(-3,- 1 ) (1 ,3) Two extra notes ^^^4 ^ 60 .2| + |3 (0.75 - 0.52| + 

added/inserted in hummed , *x*i^- N*rvo. 

(cost of 1st msertion)*0.8 + 

^^^^^y (cost of l""^ insertion)*0.6 

It will thus be seen from tlie foregoing that at a matching point (X^^Yj) in the matrix 
formed by X and Y, the engine 16 searches for a preceding set of notes 
^i-}^k 5 ^O-J ) C^'-i ' ^7-1-* )} 0 ^ ^ maxjt , which minimize a match cost defined as follows: 
lfk=0, 

a\YRFj ^ XRF^ + p \yRTj_, - A-i?7;..,|, 
else ifk>0. 



k-l 



a \YRFj_, - XRF,^,^, I + p \YRTj_, - XRT-^,^, \ + J] (penalty for the (m + 1 )^ insertion) * XTCON,,, 

m«=0 

a\YRFj^,^, - XRF^ + p \YRTj_,_, - XRT^ + (penalty for k deletions) * XDCON,^, 
where a and P are weights. 
4.. Utility 

The melody retrieval system 1 0 can be used in, but is not limited to, the following 
applications: 

• Intelligent user interface of a music jukebox. Thousands of songs are typically 
stored inside a typical jukebox. To select a song from tliis large database is 
sometimes not an easy task using the traditional input method. Using the melody 
retrieval system 10, the user can hum a few notes and the system will search 
through all tlie songs stored in the jukebox and then output the songs tliat most 
closely match the humming melody. The user can then pick the song he or she 
wants. This system can be extended beyond the jukebox application to many 
consumer audio entertainment products that store songs, such as portable music 
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players like MD™ pJayer, Walkman™, Discman'^^\ MPS'^w poitable player, and 
others. 

• A tool to search for a song or music piece on the Internet. The music retrieval 
system 10 can be used as an Internet music-searching engine similar to a 

5 conventional text-based web page searching engine. The user hums the melody 

and the tool can initiate a search in an online music database. Such a tool can also 
preferably spawn multiple searches in multiple databases. The results of the 
parallel search can be consohdated, sorted and output according to the matching 
scores. The output may also be a hypertext link such that the user can directly 

10 select the song he or she wants and connect to the web site that stores the song for 

purchasing or downloading. 

• A tool to help cellular phone users to download songs from cellular phone or 
wireless content providers. The next generation mobile phones (e.g. 3G cellular 
phone) not only support high bit rate transmission but also have the local digital 

15 signal processmg power to decode digitally-compressed audio format such as 

MPS . However it may be difficult for users to select songs from the mobile 
phone due to the small niraierical keypad interface and small LCD screen. The ■ 
music retrieval system 10 can be employed to enable the user to hum a melody 
into the mobile phone which can tlien transmit the input melody back to the base 

20 . station where the system 1 0 is preferably located. Once the note-matching engine 

16 finishes the matching process the output subsystem 1 8 can transmit a list of the 
top-ranked songs back to the user. The list can be displayed on tte screen of the 
mobile phone for the user to select the song to download, 

• Password protection. Rather than having a text-based password protection 

25 mechanism to access user accounts and the like, a query based on humming can 

be employed in the alternative. 
5. Variants 

One aspect of the invention is concemed with estimating or determining breakpoints 
based on changes in the spectral energy distribution of the input melody. One 
30 implementation of the SED indicator has been described. There are alternative ways of 
computing the SED indicator which nevertheless yield similar properties to the above- 
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described implementation. One broad class of SED indicators is defined by — 

k 

where f(k) and g(X(k)) are non-negative and non-decreasing functions ofk andX(k), 
respectively. The previously described implementation of the SED indicator used f(k) = k 
and g(X(k)) = X(k), However, other choices can produce similar results. For example: 
.: • Figs. 6A - 6C show plots similar to Figs. 5A - 5C where the SED indicator is 

defined according to -4^^-; ,i.e., = and g(X(k)) = X(k) , 

k 

• Figs. 7 A - 7C show plots similar to Figs. 5A - 5C where the SED indicator is 
defined according to , i.e., /(k) = and g(X(ky) = X(k) . 

A- 

• Figs. 8A to 8C show plots similar to Figs. 5 A - 5C where tlae SED indicator is 



defined according to — — ^""'^ , where K is tlie frequency bin corresponding 

k 

7ck 

to the Nyquist frequency, i.e., f(k) = sin( ) and = A^(^) . 

2IC 

• Figs. 9A - 9C show plots similar to Figs. 5A - 5C where the SED mdicator is 
defined according to 4^ r^i e., f{k)^k and giXik))^ X{kf . 

2^^ik)' 

k 

In the preferred embodiment, the SED indicator is defined so that it achieves large 
values if the energy spectrum is concentrated in high frequencies and small values if tiie 
energy spectrum is concentrated at low frequencies. An inverse relationship may be 
employed. Also, alternative embodiments may choose different frequency ranges, such as 
achieving large values within a band or bands of frequencies and low values outside tliat band 
or bands. This might be done to differentiate other types of breakpoints. 

It should also be appreciated that the SED indicator need not be computed from the 
energy spectrum. For example, the SED indicator illustrated in Fig. 5 A could be computed 
by estimating the slope at the origin of the auto-correlation of each frame and normalizing 
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tliat slope by tlie value at the origin. This is due to the fact that the auto-correlation and the 
energy spectrum are Fourier Transform pairs, and thus contain the same information. 

Many examples have been given for the various parameters used in the spectral 
analysis techniques discussed herein. This is done for the purpose of illustration only and not 
intended to be limiting. Witliout limiting the generality of the foregoing, these examples 
includes: the "largest" auto-correlation peaks; the "closeness" of pitch candidates to other 
pitch estimates; the "nearness" of frames to a local minimxim in the SED indicator; the 
"depth" of the local minimum; the "smallness" of the absolute values of the SED indicators; 
the "largeness" of the minimum measure in the vicinity of a breakpoint; the "largeness" of 
the rate of change of pitch; and the "threshold" for tlie confidence measm-e. As will be 
appreciated by those skilled in this art, a vvide range of crisp values can used to implement 
what are essentially fuzzy logic concepts. 

The prefened embodiment has been presented in a system block diagram format, but 
in practice the invention may be implemented in software or hardware, or a combination of 
both. Similarly, those skilled in the art vsdll understand that numerous other modifications 
and variations may be made to the embodiments disclosed herein without departmg from the 
spirit or scope of the invention. 
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CLAIMS 



We claim: 



1 . A method for converting a digitized melody into a sequence of notes, 

comprising: 

5 segmenting said melody into a series of frames; 

computing a spectral energy distribution (SED) indicator for each jfirame; and 
estimating initial breakpoints in said melody based on said SED indicator, said 
notes being defined between adjacent initial breakpoints. 

2. A method according to claim 1, wherein the value of said SED 

10 indicator for a given frame is relatively large if an energy distribution associated with said 
frame is concentrated in one or more specified frequency bands. 

3. A method according to claim 2, including filtering said melody with a 
high pass filter prior to segmenting said melody into said frames. 

4. A method according to claim 3, wherein said energy distiibution is 
15 detemiined from a normalized energy spectrum of said frame. 

5. A method according to claim 3, wherein said specified frequency band 
is the upper portion of a 0 to 4 kHz range. 

6. A method according to claim 3, wherein the SED indicator is defined 

as ^ ^g(X(^k)) — ' ^^-^ energy spectrum of a frame at frequency bin k Bndffk) 

k 

20 and g(X(k)) are non-negative and non-decreasing functions of k and X(k), respectively. 

7. A method according to claim 6, wherein the SED indicator is defined 

as * 



as ^' 



* 

8. A method according to claim 6, wherein the SED indicator is defined 
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as 



9. A method according to claim 6, wherein tlie SED indicator is defined 



k 

Id. A method according to claim 6, wherein the SED indicator is defined 



as — ^X(k) ' y^h^^^ K is the iBrequency bin corresponding to the Nyquist firequency. 

k 

5 . 11 . A method according to claim 6, wherein the SED indicator is defined 

k 



as 



k 



12. A method according to claim 3, wherein the auto-correlation of each 
said frame is computed and said SED indicator is computed by estimating the slope at the 
origin of the frame's auto-correlation and normalizing tliat slope by the value at the origin. 
10 13 . A method according to claim 1 , including estimating the pitch of each 

said frame. 

14. A method according to claim 13, wherein estimating tlie pitch of each 
frame comprises: 

computing the auto-correlation of each said frame; and 
55 estimating the pitch of each said frame by selecting a pitch period 

corresponding to a shift where the auto-correlation coefficient associated with the frame is 
relatively large. 

15. A method according to claim 1, including estimating the pitch of each 
said note between adjacent initial breakpoints. 

20 1 6. A method according to claim 15, wherein estimating the pitch of each 

note between initial breakpoints comprises: 

computing the auto-correlation of each said frame; 
estimating the pitch of each said frame by selecting a pitch period 
corresponding to a shift where the auto-correlation coefficient associated with the frame is 
25 relatively large; and 
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averaging or taking the median of the pitch estimates of frames between 
adjacent breakpoints. 

17. A method according to claim 15, including associating each said initial 
breakpoint with a confidence level, which is influenced by at least one of (a) the degree in the 

5 change or rate of change of pitch in tlie frames around tlie initial breakpoints, and (b) the 
value of said SED indicator in the vicinity of the initial breakpoint. 

18. A method according to claun 17, wherein tlie confidence level is 
further influenced by the energy level of said melody in the vicinity of the initial breakpoint 

19. A method according to claim 17, including eliminating from 
10 consideration initial breakpoints associated with confidence levels below a specified 

threshold, thereby identifying breakpoints in said melody. 

20. A method according to claim 19, including estimating the pitch and 
beat duration of each said note between said breakpoints. 

21 . A method according to claim 1, wherein the melody is a voice- 
is hummed melody composed of a series of uttered semi-vowels. 

22. Apparatus for converting a digitized melody into a sequence of notes, 

comprising: 

means for segmenting said melody into a series of frames; 

means for computing a spectral energy distribution (SED) indicator for each 

20 frame; and 

means for estimating initial breakpoints in said melody based on said SED, 
said notes being defined between adjacent initial breakpoints. 

23 . Apparatus according to claim 22, wherein the value of said SED 
indicator for a given frame is relatively large if an energy distribution associated with said 

25 frame is concentrated in one or more specified frequency bands. 

24. Apparatus according to claim 23, including filtering said melody with 
a high pass filter prior to segmenting said melody into said frames. 

25. Apparatus according to claim 24, wherein said energy distribution is 
determined from a normalized energy spectrum of said frame. 

30 26. Apparatus according to claim 24, wherein said specified frequency 

band is the upper portion of a 0 to 4 kHz range. 
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27. A method for converting a digitized melody into a sequence of notes, 

comprising: 

segmenting said melody into a series of frames; 
computing the auto-correlation of each said frame; 
5 estimating the pitch of each said frame based on (i) a pitch period 

corresponding to a shift where the auto-correlation coefficient associated with the frame is 
relatively large and (ii) the closeness of the pitch estimate to estimates in one or more 
adjacent frames; and 

estimating breakpoints in said melody based on changes in said pitch 
10 estimates, said notes being defined between adjacent breakpoints. 

28. A method according to claim 27, wherein said breakpoints are 
estimated based on a rate of change of said pitch estimates. 

29. A method according to claim 27, including filtering said melody with a 
band pass filter prior to segmenting the melody into frames. 

15 30. A method according to claim 27, including estimating the pitch of each 

note by selecting the average or median pitch of the frames falling witliin a pair of 
brealqjoints. 

31. A method according to claim 27, wherein the melody is a voice- 
hummed melody. 

20 32. Another aspect of the invention provides a method for identifying 

breakpoints in a digitized melody, the method comprising: 

segmenting the melody into a series of frames- 
computing the auto-correlation of each frame; 

estimating the pitch of each frame based on (i) a pitch period corresponding to 
25 a shift where the auto-correlation coefficient associated with the frame is relatively large and 
(ii) the closeness of tiie pitch estimate to estimates in one or more adjacent frames; 

determining regions of said melody where pitch estimates are likely to be 

invalid; and 

identiiying said breakpoints in the melody based on transitions between 
30 frames having valid pitch estimates and transitions having invalid pitch estimates. 
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33. A method according to claim 32, wherein said brealcpoints are 
estimated based on a rate of change of said pitch estimates. 

34. A method according to claim 32, including filtering said melody witli a 
band pass filter prior to segmenting the melody into fi-ames. 

5 35. A method according to claim 32, including estimating the pitch of each 

note-by selecting the average or median pitch of the frames falling within a pair of 
brealcpoints. 

36. A method according to clami 32, wherein the melody is a voice- 
hummed melody. 

10 37. Apparatus for converting a digitized melody into a sequence of notes, 

comprising: 

means for segmenting said melody into a series of frames; 

means for computing the auto-correlation of each said fiame; 

means for estimating the pitch of each said frame based on (i) a pitch period 
15 corresponding to a shift where the auto-correlation coefficient associated with the fi-ame is 
relatively large and (ii) the closeness of the pitch estimate to estimates in one or more 
adjacent fiames; 

means for detemiining regions of said melody where pitch estimates are likely 
to be invalid; and 

20 means for estimating breakpoints in said melody based on changes in said 

pitch estimates or transitions between fi-ames having valid pitch estimates and fi-ames having 
no pitch estimates, said notes being defined between adjacent breakpoints. 

38. A method of retrieving at least one entry fi*om a music database, 
wherein each said entry is associated with a sequence of pitches and beat durations, said 
25 metliod comprising: 

receiving a digitized representation of an mput melody; 
identifying breakpoints in said melody in order to define notes therein, each 
said notes being delineated by adjacent breakpoints; 

assigning a coirfidence level to each note or each breakpoint; 
30 determining a pitch and beat duration for each note of said melody; 
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determining a score for each said entry based on a search which minimizes tlie 
cost of matching the pitches and beat durations of said melody and said entry, wherein said 
search considers at least one deletion or insertion error in a selected note of said melody and, 
in this event, penalizes the cost of matching based on the confidence level of the selected 
5 note or a breakpoint associated therewith; and 

presenting said at least one entry to a user based on its score. 

39. A method according to claim 38, wherein said pitches and beat 
durations are relative pitches and relative beat durations. 

40. A method according to claim 38, wherein the cost of matching a given 
note of said melody witli a given note Yj associated with said entry is: 

match _oost{X,Jj) =a\YRFj - A"AP,.|+ ^\YRTj -XR7;|, where YRFjOnd 

YRTj respectively represent tlie relative pitch and relative beat dui ation of the note 

associated with said entry; XRF^ and XRTf respectively represent the relative pitch and 
relative beat duration of the note associated with said melody; and a and p are weiglits. 

41 . A method according to claim 38, wherein: 
a confidence level is assigned to each note and each breakpoint; and 
said search considers deletion and insertion errors for any giyen note of said 

melody and, in this event, penalizes the cost of matching based on the confidence level of the 
given note and tlie confidence level of a breakpoint associated with the given note. 

42. A method according to claim 4 1 , wherein: 

X is a sequence of notes, JST,. , of said melody, each having components 
XRF.^, XRTf XJCONi, and XDCONg which respectively represent the relative pitch, relative 
beat duration, confidence level of the breakpoint and confidence level of the note associated 
with said melody; 

Y is a sequence of notes, Yj , of said entry, each Yj having components YRFj 
and YRTj which respectively represent the relative pitch and relative beat duration of the 
note associated with said entry; 
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X aiid Y fonn a matrix, aiid at a matching point {X,,Yj) said seaich seeks to 
identify a preceding set of notes {y,.,.* , ) , 7^._,_^ )} 0 < A: < max;^ , which minimize 
a match cost defined as follows: 

if k = 0, a \YRFj - XRF„_, \ + p \YRTj_, - XRT,_, \, 
5 elseifk>0, 

a.\YRFjl, -XRFi_,_,\ + ^\yRTj_, - XRTi_^_,\ + ^(:ptn&\ty for the(m + 1)"' insertion) * A7COA^. 

m-O 

a\YRFj_,_j^-XRF,^\ + ^\YRTj,^,^ -A227;.,| + (penalty for k deletions) *Xi)CQA^,.i 
where a and fi are weights. 

43. Apparatus for retrieving at least one entry from a music database, 
wherein each said entry is associated with a sequence of pitches and beat durations, said 
appaiatus comprising: 

10 means for receiving a digitized representation of an input melody; 

a melody-to-note conversion subsystem for identifying breakpoints in said 
melody in order to define notes therein, said subsystem determining a pitch and beat dui*ation 
for each note of said melody and associating each note or each breakpoint with a confidence 
level; 

15 a note-matching engine for detei-mining a score for each said entry based on a 

search which minimizes tlie cost of matching the pitches and beat durations of said melody 
and said entry, wherem said search considers at least one deletion or insertion error in a 
selected note of said melody and, in this event, penalizes the cost of matching based on the 
confidence level of the selected note or a breakpoint associated therewith; and 

io an output subsystem for presenting said at least one entry to a user based on 

its score. 

44. A metliod of retrieving at least one entry from a music database, 
wherein each said entry is associated with a sequence of pitches and beat durations, said 
method comprising: 

J5 receiving a digitized representation of an input melody; 

identifying breakpoints in said melody in order to define notes therein, each 
said notes being delineated by adjacent breakpoints; 
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associating a confidence level witli each note pertaining to likelihood tliat said 
note contains a note insertion error; 

determining a pitch and beat duration for each note of said melody; 
determining a score for each said entry based on a search which minimizes tlie 
5 cost of matching the pitches and beat durations of said melody and said entry, wherein said 
search considers at least one insertion error in a selected note of said melody and, in this 
event, penalizes the cost of matching based on the confidence level associated with the 
selected note; and 

presenting said at least one entry to a user based on its score. 
*o 45. A method of retrieving at least one entry from a music database, 

wherein each said entry is associated witli a sequence of pitches and beat durations, said 
method comprising: 

receiving a digitized representation of an input melody- 
identifying breakpoints in said melody in order to define notes therein, each 
15 said notes being delineated by adjacent breakpoints; 

associating a confidence level with each note pertaining to likelihood that said 
note contains a note deletion error; 

determining a pitch and beat duration for each note of said melody; 
determining a score for each said entry based on a search which minimizes the 
20 cost of matching the pitches and beat durations of said melody and said entry, wherein said 
search considers at least one deletion error in a selected note of said melody and, in this 
event, penalizes the cost of matching based on the confidence level associated with the 
selected note; and 

presenting said at least one entry to a user based on its score. 
25 46. A method for determining confidence levels for breakpoints or notes in 

a waveform representing a melody, the method comprising: 

segmenting the waveform into a series of frames, wherein adjacent 
breakpoints encompass one or more sequential frames; 

executing at least two of tlie following three steps, 
30 (a) computing a spectral energy distribution (SED) indicator for each frame, 

(b) estimating the pitch of each frame, and 
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(c) detemiiiiing the energy level of each frame, 

deriving the confidence levels based on at least two of the following three 
characteristics, (i) tlie SED indicator, (ii) changes in pitch, and (iii) the energy level. 

47. A method according to claim 46, wherein the conjBdence level for a 
5 given breakpoint is computed as a weighted combination of at least two of three numbers, tlie 
first number based on the value of the SED indicator in the vicinity of the given breakpoint, 
the second number being based on a change in pitch in the frames before and after the given 
breakpoint, and the third number being based on the energy level of the frames in the 
immediate vicinity of the breakpoint. 
10 48. A method according to claim 46, wherein the confidence level for a 

given note is computed as a weighted combination of at least two of three numbers, the first 
number based on the value of the SED indicator in the given note, the second number being 
based on the variation in pitch in the given note, and the tliird number being based on the 
energy level of tlie frames in the given note. 
15 49. A method for detennining confidence levels for breakpoints or notes in 

a waveform representing a melody, the method comprising: 

segmenting the waveform into a series of frames, wherein adjacent 
breakpoints encompass one or more sequential firames; 

computing a spectral energy distribution (SED) indicator for each frame; 
20 estimating the pitch of each frame; and 

deriving the confidence levels based on the SED indicator and changes in 

pitch. 

50. A metliod according to claim 49, wherein the confidence level for a 
given brealcpoint is computed as a weighted combination of a first number based on the value 

25 of the SED indicator in the vicinity of the given breakpoint and a second number based on a 
change in pitch in the frames before and after the given breakpoint. 

51. A metliod accordmg to claim 49, wherein the confidence level for a 
given note is computed as a weighted combination of a first number based on the value of the 
SED indicator within tlie given note and a second number based on tlie variation in pitch 

30 within the given note. 
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52. A method according to claim 49, wherein the value of tlie SED 
indicator for a given frame is relatively large if an energy distribution associated with the 
frame is concentrated in one or more specified frequency bands. 

53. A method according to claim 52, including filtering the melody with a 
5 high pass filter prior to segmenting the melody into frames. 

54. A method according to claim 53, wherein the energy distribution is 
determined from a normalized energy spectrum of the frame. 

55. A method according to claim 54, wherein the specified frequency band 
is in tlie upper portion of a 0-4kHz frequency range. 

^0 56. A method for detemiining confidence levels for breakpoints or notes in 

a waveform representing a melody, the method comprising: 

segmenting the wavefomi into a series of frames, wherein adjacent 
breakpoints encompass one or more sequential frames; 

computing a spectral energy distribution (SED) indicator for each frame; 
15 determining the energy level of each frame; and 

deriving the confidence levels based on the SED indicator and the energy 

level. 

57. A method according to claim 56, wherein the confidence level for a 
given break point is computed as a weighted combination of a first number based on the 

20 value of the SED indicator in the vicinity of the given breakpoint and a second number based 
on the energy level of the fi*ame in the immediate vicinity of the breakpoint. 

58. A method according to claim 56, wherein the confidence level for a 
given note is computed as a weighted combination of a first number based on the value of the 
SED indicator in given note and a second number based on the energy level of the frames in 

25 the given note. 

59. A method according to claim 56, wherein the value of the SED 
indicator for a given frame is relatively large if an energy distribution associated with the 
frame is concentrated in one or more specified frequency bands. 

60. A method according to claim 59, including filtering the melody with a 
30 high pass filter prior to segmenting the melody into frames. 
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61. A method according to claim 60, wherein the energy distribution is 
determined from a nomialized energy spectrum of the frame. 

62. A metliod according to claim 61 , wherein the specified frequency band 
is the upper portion of a 0-4kHz frequency range. 
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